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In this paper we present several of the salient theoretical and 
practical issues associated with modeling a speech signal as a prob- 
abilistic function of a (hidden) Markov chain. First we give a concise 
review of the literature with emphasis on the Baum-Welch algorithm. 
This is followed by a detailed discussion of three issues not treated in 
the literature: alternatives to the Baum-Welch algorithm; critical 
facets of the implementation of the algorithms, with emphasis on 
their numerical properties; and behavior of Markov models on certain 
artificial but realistic problems. Special attention is given to a par- 
ticular class of Markov models, which we call "left-to-right" models. 
This class of models is especially appropriate for isolated word 
recognition. The results of the application of these methods to an 
isolated word, speaker-independent speech recognition experiment 
are given in a companion paper. 

I. INTRODUCTION 

It is generally agreed that information in the speech signal is encoded 
in the temporal variation of its short-duration power spectrum. To 
decode the signal, then, requires techniques for both estimation of 
power spectra and tracking their changes in time. This paper is 
concerned with the application of the theory of probabilistic functions 
of a (hidden) Markov chain to modeling the inherent nonstationarity 
of the speech signal for the purposes of automatic speech recognition 

(ASR). 

The use of hidden Markov models for ASR was proposed by Baker 1 ' 2 
and, independently, by a group at IBM. 3-12 The theory on which their 
work rests is due to Baum et al. 13 " 17 Its first appearance in the literature 
occurred several years before Baker's studies and has since been 
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explored in some detail. 1819 Our previous work in ASR has used 
temporal alignment procedures based on dynamic programming tech- 
niques, 20 and we hoped that through studying the new (to us) body of 
material we could improve the performance and/or capabilities of our 
present ASR systems. 

Our initial goal, therefore, was to understand the theory of hidden 
Markov models sufficiently well to enable us to implement a new ASR 
system that could be compared directly to our existing ones. We have, 
in fact, been able to accomplish that goal, and a description and the 
results of our experiments are reported in a companion paper. 21 In the 
course of our studies, we have collected and integrated a number of 
loosely related mathematical techniques pertinent to Markov model- 
ing. We have also modified and adapted these techniques to the 
specific ASR problems we wished to study. Our purpose in writing this 
tutorial, then, is to present this synthesis in a way that will be 
enlightening to those not familiar with the theory of hidden Markov 
models. We also hope that this treatment will provide for a better 
understanding of our companion paper. Finally, we hope to make the 
presentation general enough so that the theory is seen to be applicable 
to more than the problem of ASR. 

We shall proceed as follows. We begin by defining probabilistic 
functions of a (hidden) Markov 22 chain and then show how they may 
be used in a natural way to model the speech signal. Once this is done, 
our task is reduced to solving two specific and well-defined mathemat- 
ical problems: (i) computing the parameters of a proposed model 
conditioned on a sequence of observations assumed to have been 
generated by the model, and (ii) calculating the probability that a 
given set of observations was produced by a particular model. 

First we review the solution to these problems as originally given by 
Baum, 13 who treated them as problems in statistical estimation. Lest 
the problems be too narrowly construed, we look at them as problems 
of classical constrained optimization. This allows us to give a partial 
geometrical interpretation to the Baum- Welch algorithm and to relate 
it to other studies of the problem by Baum and Eagon 14 and Baum 
and Sell. 17 It also makes clear that there are other methods of solution 
available that may, in certain instances, have advantages over the 
Baum- Welch algorithm. Finally, we discuss the dynamic programming 
algorithm of Viterbi 23 as an alternative to the so-called "forward- 
backward" method of Baum 13 for computing the probability of a 
sequence of observations conditioned on a specific model. 

The treatment of these problems in the existing literature, and as 
recounted here, can lull a prospective user of the theory into a false 
sense of security. The equations look innocuous enough but, in reality, 
they overlook two problems that, though uninteresting from a theo- 
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retical standpoint, are of great significance for a robust implementa- 
tion. We believe it is worthwhile to address, first, a numerical problem 
arising from the evaluation of certain frequently occurring algebraic 
expressions, and then an experimental difficulty precipitated by the 
inescapable reality of finite training-data set. 

The numerical problem arises because, regardless of the method of 
solution chosen, one is required to evaluate a product of stochastic 
matrices involving a number of factors proportional to the number of 
observations. In any real computer, this will ultimately result in 
underflow. Fortunately, the computation can be scaled using a tech- 
nique that subsequently will be seen to have some very useful prop- 
erties. 

The problem of insufficient training data can be ameliorated by 
changing the constraints on the optimization problem. This can be 
simply and directly accomplished in the classical methods. We show 
that the Baum- Welch algorithm, too, can be modified to produce the 
same result. Both of these methods appear to be simpler in implemen- 
tation than the technique proposed by Jelinek and Mercer. 9 

Finally, under the heading of implementational considerations, we 
discuss techniques for model averaging. These can be used both for 
block processing of observations in case one is subject to storage 
limitations, and for increasing model stability under some circum- 
stances. 

The speech recognition experiment that we had in mind was on a 
speaker-independent, isolated word recognition system with a small 
vocabulary. Oddly enough, this is a simpler task than those to which 
the theory had already been applied by Baker 1 and the IBM group. 3 " 12 
Perhaps our choice of a manageable problem is responsible for the 
degree of success reported in the companion paper. 21 We determined 
that for our ASR task it is advantageous to use a particular kind of 
hidden Markov model, which we call a left-to-right model. In such a 
model there are strong temporal constraints on the Markov chain. 
First, any state, once left, cannot be later revisited. Second, there is a 
final absorbing state in which all observation sequences are assumed 
to terminate. These restrictions on the sequences affect both parame- 
ter estimation and probability computation procedures. We show how 
the Baum- Welch algorithm, the Viterbi algorithm, and the classical 
methods can all be adapted for use with left-to-right models. 

We conclude our presentation with several sample solutions to some 
artificial but nontrivial problems that illustrate concepts treated in the 
foregoing discussion. 

II. A REVIEW OF THE THEORY 

A probabilistic function of a (hidden) Markov chain is a stochastic 
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process generated by two interrelated mechanisms, an underlying 
Markov chain having a finite number of states, and a set of random 
functions, one of which is associated with each state. At discrete 
instants of time, the process is assumed to be in some state and an 
observation is generated by the random function corresponding to the 
current state. The underlying Markov chain then changes states ac- 
cording to its transition probability matrix. The observer sees only the 
output of the random functions associated with each state and cannot 
directly observe the states of the underlying Markov chain; hence the 
term hidden Markov model. 

In principle, the underlying Markov chain may be of any order and 
the outputs from its states may be multivariate random processes 
having some continuous joint probability density function. In this 
discussion, however, we shall restrict ourselves to consideration of 
Markov chains of order one, i.e., those for which the probability of 
transition to any state depends only upon that state and its predeces- 
sor. We shall also limit the discussion to processes whose observations 
are drawn from a discrete finite alphabet according to discrete proba- 
bility distribution functions associated with the states. 

It is quite natural to think of the speech signal as being generated 
by such a process. We can imagine the vocal tract as being in one of a 
finite number of articulatory configurations or states. In each state a 
short (in time) signal is produced that has one of a finite number of 
prototypical spectra depending, of course, on the state. Thus, the 
power spectra of short intervals of the speech signal are determined 
solely by the current state of the model, while the variation of the 
spectral composition of the signal with time is governed predominantly 
by the probabilistic state transition law of the underlying Markov 
chain. For speech signals derived from a small vocabulary of isolated 
words, the model is reasonably faithful. The foregoing is, of course, an 
oversimplification intended only for the purpose of motivating the 
following theoretical discussion. 

Let us say that the underlying Markov chain has N states q\, q%, 
• • • , qN and the observations are drawn from an alphabet, V, of M 
prototypical spectra, V\,V2, • • • , vm . The underlying Markov chain can 
then be specified in terms of an initial state distribution vector it' = 
(iti, TT2, • • • , ttn) and a state transition matrix, A = [a,y] 1 < i, j < N. 
Here, m is the probability of <?, at some arbitrary time, t = 0, and ay is 
the probability of transiting to state qj given current state, qt, that is 
dij = prob(qj at t + l\qt at t). 

The random processes associated with the states can be collectively 
represented by another stochastic matrix B = [&/*] in which for 1 < j 
< N and 1 < k < M, bjh is the probability of observing symbol Vk given 
current state qy. We denote this as bjk — prob(i>* at t\qj at t). Thus a 
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hidden Markov model, M, is identified with the parameter set 
W,A,B). 

To use hidden Markov models to perform speech recognition we 
must solve two specific problems: observation sequence probability 
estimation, which will be used for classification of an utterance; and 
model parameter estimation, which will serve as a procedure for 
training models for each vocabulary word. Both problems proceed 
from a sequence, O, of observations O1O2 • • • Ot where each O t for 
1 < t < T is some v k E V. 

Our particular classification problem is as follows. We wish to 
recognize utterances known to be selected from some vocabulary, W, 
of words iii, life, ••• , Wv. We are given an observation sequence, O, 
derived from the utterance of some unknown Wi E W and a set of V 
models Mi , M 2 , • • • , My. We must compute P, = prob(0 1 M.) for 1 < 
i < V. We will then classify the unknown utterance as Wi iff P, > Pj for 
l<y'<V. 

The training problem is simply that of determining the models 
Mj = (tt,, At, Bi) for 1 < i < V given training sequences O n) , (2) , 

• • • , O m , where (i) is known to have been derived from an utterance 
of word wi for 1 < i < V. 

One could, in principle, compute prob(0 1 M) by computing the joint 
probability prob (O, s|M) for each state sequence, s, of length T, and 
summing over all state sequences. Obviously this is computationally 
intractable. Fortunately, however, there is an efficient method for 
computing P. Let us define the function a t (i ) for 1 < t < T as prob(Oi0 2 

• • • O t and qi at t\M). According to the definition ai(i) = irtbt(Oi), 
where 6,(Oi) is understood to mean bik iff Oi = v k \ then we have the 
following recursive relationship for the "forward probabilities": 

N 

S at(i)aij 



<Xt + l(j) = 



bj(0 M ) lStsT-L (1) 



Similarly, we define another function, fit(j) = prob(0/+iO/+2 • • • 
Or\qj at t and M). We set faU) -IV; and then use the backward 
recursion 

Mi) = I aijbj(O t+1 )p t+ iU) T - 1 > t > 1 (2) 

7-1 

to compute the "backward probabilities." 

The two functions can be used to compute P according to 

N N 

P = Prob(0|M) = n *t{i)aiMO t +x)(St + i(j) (3) 

i=l y=l 

for any t such that 1 < t < T - 1. Equations (1) to (3) are from Baum 13 
and are sometimes referred to as the "forward-backward" algorithm. 
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Setting t = T - 1 in (3) gives 

P = I <xr(i) (4) 

i-i 

so that P can be computed from the forward probabilities alone. A 
similar formula for P can be obtained from the backward probabilities 
by setting t = 1. These and several other formulas in this section may 
be compactly written in matrix notation (see Appendix A). For in- 
stance, 

P = ir'BiABzA • • • AB T 1, (5) 

where 1 is the iV-vector (1, 1, 1, • • • , 1)' and 

/bi(O t ) 

B t -l *<°'> . ° J (6) 

\ O b N (O t )> 

for 1 < t < T. From (5) it is clear that P is a homogeneous polynomial 
in the irt, dij, and b//,. Any of eqs. (3) through (5) may be used to solve 
the classification problem. The forward and backward probabilities 
will prove to be convenient in other contexts. 

When we compute P with the forward-backward algorithm, we are 
including the probabilities of all possible state sequences that may 
have generated O. Alternatively, we may define P as the maximum 
over all state sequences i = io, d, •■• ir of the joint probability 
P(0, i). This distinguished state sequence and the corresponding 
probability of the observation sequence can be simultaneously com- 
puted by means of the Viterbi 23 algorithm. This dynamic programming 
technique proceeds as follows: Let <f>i(i) = w,-6,-(Oi) for l<i<N. Then 
we can perform the following recursion for 2 < t < T and 1 < j < N 

<)>t(j) = max[4>t-i(i)aij]bj(O t ) (7a) 

and 

Uj) = i*, (7b) 

where i* is a choice of an index i that maximizes <fc-i(()< 
The result is that P = max [<Mi)]- Also the maximum likelihood 

state sequence can be recovered from \p as follows. Let qr= i*, where 
i* maximizes P. Then for T > t > 2, q t -\ = ipt(qt). If one only wishes to 
compute P, the linked list, \p, need not be maintained as in (7b). Only 
the recursion (7a) is required. 

The problem of training a model, unfortunately, does not have such 
a simple solution. In fact, given any finite observation sequence as 
training data, we cannot optimally train the model. We can, however, 
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choose m, A, and B such that prob(0 |M) is locally maximized. For an 
asymptotic analysis of the training problem the reader should consult 
Baum and Petrie. 15 

We can use the forward and backward probabilities to formulate a 
solution to the problem of training by parameter estimation. Given 
some estimates of the parameter values we can compute, for example, 
that the expected number of transitions, yij, from qi to qj, conditioned 
on the observation sequence is just 

1 T-l 

Yv = iE at(i)aijbj(O t+l )p t+1 (j). (8) 

Then, the expected number of transitions, y, out of qi, given O, is 

the last step of which is based on (2). The ratio yg/yi is then an 
estimate of the probability of state qj, given that the previous state 
was qt. This ratio may be taken as a new estimate, dg, of ag. That is, 

T-l 

£ a t (i)aijbj(O t+1 )p t+1 U) 

- _ fit _ t-i nm 

a.ij = — — jrn . (10) 

t-i 

Similarly, we can make a new estimate of bjk as the frequency of 
occurrence of Vk in qj relative to the frequency of occurrence of any 
symbol in state qj. Stated in terms of the forward and backward 
probabilities we have 



\Oi=vk 



&-s¥=_ -. (id 



Finally, new values of the initial state probabilities may be obtained 
from 

■)-p«i<s)0i(i). (12) 

As we shall see in the next section, the reestimates are guaranteed 
to increase P, except at a critical point. 

2. 1 Proof of the reestimation formula 

The reestimation formulas (10), (11), and (12) are instances of the 
Baum- Welch algorithm. Although it is not at all obvious, each appli- 
cation of the formulas is guaranteed to increase P except if we are at 
a critical point of P, in which case the new estimates will be identical 
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to their current values. Several proofs of this rather surprising fact are 
given in the literature. 13,17 Because we shall need to modify it later to 
cope with the finite sample size problem, we shall briefly sketch 
Baum's proof 13 here. The proof is based on the following two lemmas: 
Lemma 1: Let Ui, i = 1, • • • , S be positive real numbers, and let Vt, 
i = 1, ••• , S be nonnegative real numbers such that £, Vt > 0. Then 
from the concavity of the log function it follows that 



inlls'l-ln 



2> 



Ui 



*1 



\ k j 
i 2 Uk \Ui 



Ui 



%u k 



£ (w.ln Vi - ifiln Ui) 



(13) 



Here every summation is from 1 to S. 

Lemma 2: If c, > i = 1, • • . , N, then subject to the constraint 
£, x, ■ = 1, the function 

F(x) = J c,-ln xt (14) 



attains its unique global maximum when 

Ci 



Xi = 



lei 



(15) 



The proof follows from the observation that by the Lagrange method 



d 
dXi 



F(x) - A I Xi 



= --A = 0. 

Xi 



(16) 



Multiplying by Xi and summing over i gives A = £ c„ hence the result. 

Now in Lemma 1, let S be the number of state sequences of length 
T. For the ith sequence let m, be the joint probability 



Ui = Prob[ state sequence i, observation O | model M] 
-P(i,0|M). 
Let Vi be the same joint probability conditioned on model $L Then 
5>« = />(0|M) AP(M) 



5> = />(0|M) 4P(M) 
and the lemma gives 
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(17) 



where 

Q(M,M) A 2 win w. (19) 

Thus, if we can find a model M that makes the right-hand side of (18) 
positive, we have a way of improving the model M. Clearly, the largest 
guaranteed improvement by this method results for M, which maxi- 
mizes Q(M, M) [and hence maximizes the right-hand side of (18)]. The 
remarkable fact proven in Ref. 13 is that Q (M, M) attains its maximum 
when M is related to M by the reestimation formulas (10) through (12). 
To show this let the sth-state sequence be So, Si, • • • , St, and the given 
observation sequence be O*, , • • • , 0* r Then 

T-\ T-\ 

In v 8 = In p(s, O | M) = In ^ + £ In a S/S , +1 + £ In b 8l JO l+1 ). (20) 

(=0 <-0 

Substituting this in (19) for Q(M, M), and regrouping terms in the 
summations according to state transitions and observed symbols, it 
can be seen that 

N N N M 

Q(M, M) = X 2 c>ln dij+'Z £ d jk \n bj(k) 



Here 



i=l 7=1 


>=i *=i 

N 






+ 2 e,ln mi. 

i=l 


(21) 


S 

dj = £ 

s=l 


p(s,0|M)/i iy (s) 


(22a) 


S 

djk = 2 


p(s, 0\M)mjk(s) 


(22b) 


S 


p(s,0\M)n(s), 


(22c) 



s=l 

and for the sth-state sequence 

fiij(s) — number of transitions from state qi to qj 

mjk(s) = number of times symbol k is generated in state qj 

n(s) = 1 if initial state is qi 

= otherwise. 

Thus, dj, djk, and e, are the expected values of n*,-, rrijk, n, respectively, 
based on model M. 
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The expression (21) is now a sum of 2N + 1 independent expressions 
of the type maximized in Lemma 2. Hence, Q (M, M) is maximized if 

dij = ^- (23a) 

2 c v 
j 

bj(k)=^j- (23b) 

k 

m = ~- (23c) 

i 

These are recognized as the reestimation formulas. 

2.2 Solution by optimization techniques 

Lest the reader be led to believe that the reestimation formulas are 
peculiar to stochastic processes, we shall examine them briefly from 
several different points of view. Note that the reestimation formulas 
update the model in such a way that the constraints 

N 

5>, = 1 (24a) 

j-i 

N 

2 dij =1 for 1 < i < N (24b) 

y'-i 

and 

M 

2 bjk = 1 for 1 2= j < N (24c) 

k=i 

are automatically satisfied at each iteration. The constraints are, of 
course, required to make the hidden Markov model well defined. It is 
thus natural to look at the training problem as a problem of constrained 
optimization of P and, at least formally, solve it by the classical method 
of Lagrange multipliers. For simplicity, we shall restrict the discussion 
to optimization with respect to A. Let Q be the Lagrangian of P with 
respect to the constraints (24b). We see that 

Q = P+ J Aif | ay-lV (25) 

where the A, are the as yet undetermined Lagrange multipliers. 

At a critical point of P on the interior of the manifold defined by 
(24a through c), it will be the case that for 1 < i, j < N 

ii-^+fc-O. (26) 

ddij dciij 
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Multiplying (26) by a tj and summing over j we get 



2 dP 






Xi--Ai-t— , (27) 

oOtj 



where the right-hand side of (27) follows from substituting (24b) for 
the sum of oy and then replacing A, according to (26). From (27) it 
may be seen that P is maximized when 

dP 

aij= ~^w- (28) 

A similar argument can be made for the tt and B parameters. 

While it is true that solving (28) for a,y is analytically intractable, it 
can be used to provide some useful insights into the Baum- Welch 
reestimation formulas and alternatives to them for solving the training 
problem. Let us begin by computing dP/daij by differentiating (3), 
according to the formula for differentiating a product, 

dP T ~ 1 

— - 2 <* t (i)bj(O M )0 t+1 (j). (29) 

oClij t-\ 

Substituting the right-hand side of (29) for dP/datj in (28), we get 

T-l 

X ct t (i)aijbj{0 M )Pt + x(j) 
o-ij = N r _ t . (30) 

£ £ a t (i) ai jbj(O t+1 )p t+1 U) 
j-i t=i 

Then changing the order of summation in the denominator of (30) and 
substituting in the right-hand side of (2) we get 

r-i 

£ at(i)a u bj(Ot + i)P t+ iU) 
au = — yzj . (31) 

2 ou(i)Pt(i) 

1=1 

The right-hand side of (31) is thus seen to be identical to that of the 
reestimation formula (10). Thus, at a critical point, the reestimation 
formula (10) solves the equations (31). Similarly, if we differentiate (3) 
. with respect to in and bj* we get 

^- = £ b i (0 1 )a i M0 2 )MJ) 

= bi(Oi)fa(i) (32) 

SPEECH RECOGNITION 1045 



and 

aP N 

— = 2 Z <*t{i)aijP M (j) + 8(Oi, VkivjfiiU) . (33) 

respectively. In (33) 5 is understood to be the Kronecker 5 function. 

By substituting (32) and (33) into their respective analogs of (28), 
we obtain the reestimation formulas (12) and (11), respectively, at a 
critical point. Thus it appears that the reestimation formulas may 
have more general applications than might appear from their statistical 
motivation. 

Equation (28) suggests that we define a transformation, T, of the 
parameter space onto itself as 

dP 

TUW-T^p (34) 

L x ik - — 
k=,\ oXik 

where T(x), 7 is understood to mean the i, y'th coordinate of the image 
of x under T. The parameter space is restricted to be the manifold 
such that Xij > for 1 < i, j<N and£y=i xy = 1 for 1 < i < N. Thus, 
the reestimation formulas (10), (11), and (12) are a special case of the 
transformation (34), with P a particular homogeneous polynomial in 
the x^ having positive coefficients. Here the Xij include the m, the Oij, 
and the &,*. Baum and Eagon 14 have shown that for any such polyno- 
mial P[T(x)] > P(x) except if x is a critical point of P. Thus the 
transformation, T, is appropriately called a growth transformation. 
The conditions under which T is a growth transformation were relaxed 
by Baum and Sell 17 to include all polynomials with positive coefficients. 
They further proved that P increases monotonically on the segment 
from x to T(x). Specifically, they showed that P[rjT(x) + (1 - tj)x] > 
P(x) for < 7? < 1. Other properties of the transformation (34) have 
been explored by Passman 18 and Stebe. 19 There may be still less 
restrictive general criteria on P for T to be a growth transformation. 

We can give T(x) a simple geometric interpretation. For the purposes 
of this discussion we shall restrict ourselves to x G R^, Xi > for 1 < 
i < N, and the single constraint G(x) = ££i x« - 1 = 0. We do so 
without loss of generality, since constraints such as those of (24a, b, 
and c) are disjoint, i.e., no pair of constraints has any common 
variables. As shown in Fig. 1, given any x satisfying G(x) = 0, T(x) is 
the intersection of the vector X, or its extension, with the hyperplane 

dP 

2-i xi- 1 = 0, where X has components Xi — for 1 < i < N. 

OXi 

This may be shown by observing that a line in the direction of X 
passing through the origin has the equation y = rX, where r is a non- 
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6U) - 2 x/ -1 - - 

i ' 1 



-,U(7J)r(x) + [1-/1(7J))X 




Fig. 1 — Geometrical relationship of the quantities involved in the reestimation for- 
mulas. 



negative scalar. Component-wise this is equivalent to 

dP 
v, = rxi — for 1 < i < N. 
dXi 



(35) 



We can find that r for which y intersects the hyperplane G (x) = by 
summing over i. Thus 

n N dP 

Zyi = rZx, — =1, (36) 

i-=l i-l oXi 

since y lies on the hyperplane G (x) = 0. Rearranging (36) we have 

1 



r = 



£ dP 

,-i dXi 



(37) 



and 



dP 

Xi T~ 
_ dXj 

y ' » dP' 

L xj — 
y=i dXj 



(38) 



Furthermore, as also shown in Fig. 1, the vector [T(x) — x] is the set 
of intersections of the vector (x + tjX) with the hyperplane G (x) =0 
for < 7} < +oo with T(x) corresponding to tj = +oo and x to tj = 0. 

Finally, in view of the result of Baum and Sell, quoted above, the 
vector T(x) — x must also have a positive projection on VP. This, 
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too, is easily seen. If P is a polynomial with positive coefficients, then 
dP/dXi > for 1 < i < N. From the definition of T it is clear that 

aD N ap 

T(x),>x, iff ^->£x, — = r, (39) 

dX, y_i OX/ 

where r is some constant. Then it must be true that 

N /dP \ 

Z\T(x)i-Xi][- r >0 (40) 

«-l \OXi J 

since both factors in each summand are of the same sign. Rearranging 
(40) we have 

I [T(x). -ftlf-irj [T(x), - x,] = 0. (41) 

The right-hand side is zero since ££i T(x), = £&i Xi = 1. Thus 
[T(x) - x]- VP > 0, proving that a step of the transformation has a 
positive projection along the gradient of P. 

This merely guarantees that we can move an infinitesimal amount 
in the direction of [T(x) - x] while increasing P. The theorem of 
Baum and Eagon, however, guarantees much more, namely that we 
can take a finite step and be assured of increasing P. We may, in fact, 
be able to continue past T(x) while still increasing P. We are unable, 
at present, to give a geometrical interpretation of this fact. 

While the reestimation formulas provide an elegant method for 
maximizing P, their success depends critically on the constraint set 
(24a, b, and c). As we will suggest later, in some cases there may be 
advantages in using classical optimization methods. 

The principle of the classical methods is to search along the projec- 
tion of VP on the constraint space, G, for a local maximum. The 
method of Rosen, 24 for example, uses only VP and a crude search 
strategy. The method of Davidon is one of many quasi-Newton tech- 
niques that uses the Fletcher-Powell 25 approximation to the inverse of 
the Hessian of P and an exact line search with adaptive step size. 
There are many collections of general purpose subroutines for con- 
strained optimization that can be used to solve the training problem. 
We have successfully used a version of the Davidon procedure from 
the Harwell Subroutine Library. 26 However, for the constraints that 
it, A, and B be stochastic, the computation can be greatly simplified. 

We illustrate this by outlining the gradient search algorithm for the 
case where P is a function of the variables x h • • • , x N subject to the 
constraints x, ; > for 1 < i < N and £j-i x, - 1 = 0. For convenience 
we will call the last constraint Gi , and the inequality constraints on xi, 
. . . , xn as Ck, • • • , Gn+i, respectively. 

An initial starting point x is chosen and the "active" constraints 
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identified. For our case G\ is always active. For i > 1, G, is active if 
;t,-i = 0. Let G nj , j = 1, • • • , £ be the active constraints (with rii = 1) 
at the initial point. Let Q = P + £^.j \jG nj . Then according to the 
Kuhn-Tucker theorem, 27 the Lagrangian multipliers, A,, are deter- 
mined by demanding that VQ be orthogonal to VG n> for 1 < / < & Now 

VQ = VP + I \jVGn, 

= VP + TX, (42) 

where T is the N X /matrix with r,, = (VG„ y ), = dG nj /Bxi, and A is the 
vector with components A/ for y = 1, • • • , £ Thus the Kuhn-Tucker 
requirement is equivalent to 

T'VQ = (43) 

or, from (42), 

a = -(rrr^'vp. (44) 

For our special constraints we have 

r,i = 1 for 1 < i < N (45) 

and, for jf*l 

fl if i = rij-l 
1 v ~ |0 otherwise. l *°' 

With r defined this way 

rT = | i i o s (47) 

and (rT) -1 may be shown to be 



(48) 



Substituting (48) into (44) gives A. When this A is substituted back into 
(42), it turns out that the resulting vector VQ can be computed by the 
following simple steps: 

(i) Compute VP and let S be the sum of all components of VP 
except (VP)„._„y=2, ... ,d 
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1 1 
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1 
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1 








(ii) Then 

(VQ), = i = 7i„ 7 = 2, • • • , t 

= (VP). otherwise. 

Finally, the values of P are searched along the line 

for a maximum with respect to tj. The procedure is repeated at this 
new point. 

In applying this technique to the actual training problem, there will 
be 2N + 1 stochasticity constraints analogous to G\ and a correspond- 
ing number of positivity constraints analogous to G2, G3, • • • Gn+i- In 
this case we have the option of treating all the parameters and their 
associated constraints together, or we may divide them into disjoint 
subsets and determine search directions for each subset independently. 

Notice that this derivation does not require P to be of any special 
form. This may prove to be an advantage since the Baum- Welch 
algorithm is not applicable to all P. Furthermore, the constraints may 
be changed. Although, as we shall see later, the Baum- Welch algorithm 
can be somewhat generalized in this respect, it does not generalize to 
work with arbitrary linear constraints. 

III. CONSIDERATIONS FOR IMPLEMENTATION 

From the foregoing discussion it might appear that solutions to the 
problems of hidden Markov modeling can be obtained by straightfor- 
ward translation of the relevant formulas into computer programs. 
Unfortunately, for all but the most trivial problems, the naive imple- 
mentation will not succeed for two principal reasons. First, any of the 
methods of solution presented here for either the classification or the 
training problem require evaluation of a t (i) and fr(i) for 1 < t < T and 
1 < i < N. From the recursive formulas for these quantities, (1) and 
(2), it is clear that as T -> °o, a T (i) -> 0, and 0i(i) -► in exponential 
fashion. In practice, the number of observations necessary to ade- 
quately train a model and/or compute its probability will result in 
underflow on any real computer if (1) and (2) are evaluated directly. 
Fortunately, there is a method for scaling these computations that not 
only solves the underflow problem but also greatly simplifies several 

other calculations. 

The second problem is more serious, more subtle, and admits of a 
less gratifying, though still effective, solution. Baum and Petrie 15 have 
shown that the maximum likelihood estimates of the parameters of a 
hidden Markov process are consistent estimates (converge to the true 
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values as T — > oo) of the parameters. The practical implication of the 
theorem is that, in training, one should use as many observations as 
possible which, as we have noted, make scaling necessary. In reality, 
of course, the observation sequence will always be finite. Then the 
following situation can arise. Suppose a given training sequence of 
length T results in bjk = 0. (It is, in fact, possible for a local maximum 
of P to he on a boundary of the parameter manifold.) Suppose further 
that we are subsequently asked to compute the probability that a new 
observation sequence was generated by our model. Even if the new 
sequence was actually generated by the model, it can be such that 
a t -\(i)aij is nonzero for only one value of j and that O t = V/,, whence 
<*t(j) = and the probability of the observation then becomes zero. 
This phenomenon is fatal to a classification task; yet, the smaller T is, 
the more likely is its occurrence. Jelinek and Mercer 9 have dealt with 
this problem in a slightly different context. Here, we offer the much 
simpler solution of constraining the parameter values so that Xy ^ *y 
>0. 

Finally, in this section we discuss the related problem of model 
stability. Baum and Eagon 14 note that successive applications of the 
reestimation formulas converge to a connected component of the local 
maximum set of P. In case there are only a finite number of such 
extrema, the point of convergence is unique to within a renaming of 
the states. The component of the local maximum set to which the 
iteration converges as well as which of the N\ labelings of the states is 
determined by the initial estimates of the parameters. If we wish to 
average several models resulting from several different starting points 
to achieve model stability, we must be able to match the states of 
models whose states are permuted. We have devised a solution to this 
problem based on a minimum-weight bipartite matching algorithm. 28 

3.1 Scaling 

The principle on which we base our scaling is to multiply a t (i ) by 
some scaling coefficient independent of i so that it remains within the 
dynamic range of the computer for 1 < t < T. We propose to perform 
a similar operation on flt{i) and then, at the end of the computation, 
remove the total effect of the scaling. 

We illustrate the procedure for (10), the reestimation formula for 
the state transition probabilities. Let a t (i) be computed according to 
(1) and then be multiplied by a scaling coefficient, c,, where say, 



c t = 



N 

£ cud) 



(50) 



so that gli c,a t (i) = 1 for 1 < t < T. Then, as we compute (3 t (i) from 
(2), we form the product c,p,(i) for T > t > 1 and 1 < i < N. In terms 
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of the scaled forward and backward probabilities, the right-hand side 
of (10) becomes 

£ Ctcu(i)aiMO M )faiU)DM 

, (51) 



Y £ C,a,(0a«vMO, + i)/W<0A + i 



where 



and 



a-n* (52) 



T 
Dt=Y[C T 



This results from the individual scale factors being multiplied together 
as we perform the recursions of (1) and (2). 

Now note that each summand in both the numerator and the 
denominator has the coefficient GDm = IE-i c r . These coefficients 
can be factored out and canceled so that (51) has the correct value o»> 
as specified by (10). The reader can verify that this technique may be 
equally well applied to the reestimation formulas (11) and (12). It 
should also be obvious that, in practice, the scaling operation need not 
be performed at every observation time. One can use any scaling 
interval for which underflow does not occur. In this case, the scale 
factors corresponding to values of t within any interval are set to unity. 

While the above described scaling technique leaves the reestimation 
formulas invariant, (3) and (4) are still useless for computing P. 
However, log P can be recovered from the scale factors as follows. 
Assume that we compute ft according to (50) for t = 1, 2, • • • T. Then 



N 



Ct 2 ar(i) = 1 (53) 

j-i 

and from (53) it is obvious that C T = l/P. Thus, from (52) we have 

n * - 4- w 

The product of the individual scale factors cannot be evaluated but we 
can compute 

T 

log-P=-£ log ft. (55) 

t-\ 

If one chooses to use the Viterbi algorithm for classification, then 
log P can be computed directly from -n, A, and B without regard for 
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the scale factors. Initially, we let <fn(i) = log|>,6,(Oi)] and then modify 
(7a) so that 

+Aj) - max[(^i(i) + log o y ] + log[6,(0,)]. (56) 

In this case log P = max [<t>r(i)]. 

IsisJV 

If the parameters of the model are to be computed by means of 
classical optimization techniques, we can make the computation better 
conditioned numerically by maximizing log P rather than P. The 
scaling method of (50) makes this straightforward. 

First note that if we are to maximize log P, then we will need the 
partial derivatives of log P with respect to the parameters of the 
model. So, for example, we will need 

Substituting the right-hand side of (29) for dP/daij in the right-hand 
side of (57) yields 

d r_1 

— -(logP) = CtI a t (i)bj(O l+1 )p t+1 (j) 
odij t -i 

T-\ 

= 2 C t ou{i)bj(0 M )fi M U)D M 

= S ( II *) «t(i)bj(O t +dp t+i {j) ( n cX (58) 

<=1 V-l / V-M-l / 

So that if we evaluate (29) formally, using not the true values of the 
forward and backward probabilities but the scaled values, then we will 
have the correct value of the partial derivatives of log P with respect 
to the transition probabilities. A similar argument can be made for the 
other parameters of the model and, thus, the scaling method of (50) 
provides a means for the direct evaluation of V(log P), which is 
required for the classical optimization algorithms. Later we shall see 
that the combination of maximizing log P and this scaling technique 
simplifies the solution of the left-to-right Markov modeling problem 
as well. 



3.2 Finite training sets 

We now turn our attention to solving the problems created by finite- 
training-set size. As we noted earlier, the effect of this problem is that 
observation sequences generated by a putative model will have zero 
probability conditioned on the model parameters. Since the cause of 
the difficulty is the assignment of zero to some parameters, usually 
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one or more symbol probabilities, it is reasonable to try to solve the 
problem by constraining the parameters to be positive. 

We can maximize P subject to the new constraints oy > e > 0; 
b Jk > € > 0, most easily using the classical methods. In fact, the 
algorithm described earlier based on the Kuhn-Tucker theorem is 
unchanged except that the procedure for determining the active con- 
straints is based on e rather than zero. 

While the Lagrangian methods are perfectly adequate, it is also 
possible to build the new constraints into the Baum- Welch algorithm. 
We can show how this is done by making a slight modification to the 
proof of the algorithm given earlier (Section 2.1). Recall that the proof 
of the Baum- Welch algorithm was based on maximization of 2N + 1 
expressions of the type maximized in Lemma 2, eq. (14). Since these 
expressions involve disjoint sets of variables chosen from A, B, -rr, it 
suffices to consider any one of the maximizations. In fact, it suffices to 
show how Lemma 2 gets modified. Thus we wish now to maximize 

F(x)-2cfIiiXi (59) 

i 

subject to the constraints 

£ Xi : = 1 (60a) 

i 

and 

*, > e, i = 1, • • • N. (60b) 

(From the following discussion it will be obvious that a trivial gener- 
alization allows e to depend on i.) 

Now without the inequality constraints (60b), Lemma 2 showed that 
F(x) attains its unique global maximum when x, = c,/£, c,-. Suppose 
now that this global maximum occurs outside the region specified by 
the inequality constraints (60b). Specifically, let 

X t --£->e for i = l,...,N-€ (61a) 

So 

<e for i = N-S+l,--,N. (61b) 

From the concavity of F(x) it follows that the maximum, subject to 
the inequality constraints, must occur somewhere on the boundary 
specified by the violated constraints (61b). Now it is easily shown that 
iff, for some i> N- € is replaced by e, then the global maximum over 
the rest of the variables occurs at values lower than those given above. 
From this we conclude that we must set 

Xi = e for i>N-€ (62) 
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and maximize 



F(x) = £ c.ln Xi (63) 



subject to the constraint Yf^i Xi ■ = 1 — £e. But this, analogously to 
Lemma 2, occurs when 

Xi = (1 - «&) j£- i<N-S. (64) 

If these new values of x, satisfy the constraints, we are done. If one or 
more become lower than e, they too must be set equal to €, and <? 
augmented appropriately. 

Thus the modified Baum- Welch algorithm is as follows. Suppose we 
wish to constrain 6y* > e for 1 < j < N and 1 < k < M. We first evaluate 
B using the reestimation formulas. Assume that some set of the 
parameters in the j th row of B violates the constraint so that 6,*,. < c 
for 1 S j < 4. Then set Sj*. = e for 1 < i < / and readjust the remaining 
parameters according to (64) so that 

6 Jk = (I- &)--»- VA^{Ai|l<i</)- (65) 

i-l 

After performing the operation of (65) for each row of B, the resulting 
B is the optimal update with respect to the desired constraints. The 
method can be extended to include the state transition matrix if so 
desired. There is no advantage to treating it in the same manner since, 
for any single observation sequence, m will always be a unit vector with 
exactly one nonzero component. In any case, (65) may be applied at 
each iteration of the reestimation formulas, or once as a post-process- 
ing stage after the Baum- Welch algorithm has converged. 

3.3 Combining models 

The final implementation^ issue that we shall consider in this 
section is that of combining models for improved stability. There are 
several circumstances under which it may be desirable to combine 
several models into one. In spectral estimation, for example, to com- 
pute a long-term average spectrum of a stationary signal, it may be 
convenient to average a number of spectra computed over shorter 
intervals. It seems quite natural to apply similar block-processing 
techniques to the Markov modeling problem if the source is assumed 
to be ergodic. We may, for example, divide a long sequence of obser- 
vations into contiguous subsequences, estimate model parameters for 
each subsequence, and combine the results. 
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Whether or not the source is ergodic, we may still attempt to 
increase the robustness of our model by averaging the parameter 
estimates derived from multiple initial values and/or independent 
observation sequences. 

In any case, the difficulty that will be encountered is that even if 
there are finitely many isolated local maxima of P, they are only 
unique to within a renaming of the states. For two different observation 
sequences, qt and qj may be topologically equivalent, but i ¥> j. We 
might try to avoid this problem by using the final parameter values for 
one observation sequence as the initial values for the next in hopes 
that this will restrict the search to a neighborhood of a single local 
maximum. This method, unfortunately, is not reliable. A better ap- 
proach is that of finding a renaming of the states that niinimizes, in 
some sense, the difference between two models. 

Suppose b,- and b„ 1 < j < N are, respectively, the rows of two 
estimates of B. Let p(j) be a permutation of the state index, j, and let 
rf(.,.) be some distance metric. Then we seek the permutation, p, of 
(qi,Q2, -•• , qN) such that 

D = I d\bj, bpo,] (66) 

7=1 

is minimized. The naive solution is to try all possible N\ permutations 
and select the best one in the sense of (66). However, for N > 10 the 
computation becomes intractable. The problem can be brought within 
reach, however, by transforming it into a minimum-weight bipartite- 
graph-matching problem on 2A7 vertices. In the literature on combi- 
natorial optimization (see, e.g., Ref. 28), several algorithms are avail- 
able for accomplishing such a match in a number of operations that 
grow as N 3 . In Appendix B, we describe one such algorithm based on 
an outline provided to us by R. E. Tarjan. 

IV. LEFT-TO-RIGHT HIDDEN MARKOV MODELS 

For the purposes of isolated word recognition, it is useful to consider 
a special class of absorbing Markov chains that leads to what we call 
left-to-right models. These models have the following properties: 

(i) The first observation is produced while the Markov chain is in 
a distinguished state called the starting state, designated q%. 

(ii) The last observation is generated while the Markov chain is in 
a distinguished state called the final or absorbing state, designated g N . 

(Hi) Once the Markov chain leaves a state, that state cannot be 
revisited at a later time. 

The simplest form of a left-to-right model is shown in Fig. 2, from 
which the origin of the term left-to-right becomes clear. 

In this section we shall consider two problems associated with these 
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a 44 = 1 




7T, = 1 



by k t>2k b 2k 

Fig. 2 — The simplest form of left-to-right model. 

special hidden Markov models. Note that a single, long-observation 
sequence is useless for training such models, because once the state qn 
is reached, the rest of the sequence provides no further information 
about earlier states. The appropriate training data for such a model 
are a set of observation sequences obtained by several starts in state 
q\. In the case of isolated word recognition, for instance, several 
independent utterances of the same word provide such a set. We wish, 
therefore to modify the training algorithm to handle such training 
data. We also wish to compute the probability that a single given 
observation sequence, Ox , O2 • • • , Or, was produced by the model, 
with the assumption that 0\ was produced in state q± and Or in state 
qN. The three conditions mentioned above can be satisfied as follows: 
Condition (i) will be satisfied if we set w = (1, 0, • • • , 0) and do not 
reestimate it. Condition (ii) can be imposed by setting 

Condition (Hi) can be guaranteed in the Baum- Welch algorithm by 
initially setting a,y = for j < i (and in fact for any other combination 
of indices that specify transitions to be disallowed). It is clear from 
(34) that any parameter once set to zero will remain zero. For the 
gradient methods the appropriate ai/s are just set to zero and only the 
remaining parameters are adjusted. 

The modification of the training procedure is as follows: Let us 
denote by O ■ [0 (U , <2> , • • • (/0 ] the set of observation sequences, 
where ( * 1 = OPoP ••• Og is the £th sequence. We treat the 
observation sequences as independent of each other and then we 
adjust the parameters of the model M to maximize 

K 

P= n Prob(0 ( *>|M). 

A-l 

= n p*> (68) 



*-j 
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Since the Baum-Welch algorithm computes the frequency of occur- 
rence of various events, all we need to do is to compute these frequen- 
cies of occurrence in each sequence separately and add them together. 
Thus the new reestimation formulas may be written as 

I I ot(i)avbj(OWi)P k M U) 

*-> '- 1 (69) 



an = 



a ~ k T k -\ 

I I ctHDPHi) 



and 



K 



gy ^fy • (70) 



k=i t-i 



As noted above, it is not reestimated. 

Scaling these computations requires some care since the scale factors 
for each individual set of forward and backward probabilities will be 
different. One way of circumventing the problem is to remove the scale 
factors from each summand before adding. We can accomplish this by 
returning the 1/P factor [which appears in (8) and (9) and was 
cancelled to obtain (10)] to the reestimation formula. Using the rees- 
timation formula for the transition probabilities as an example, (69) 
becomes 

*- "*?!»« • (?1) 

2 7T 1 «Hi)tf(i) 
k-i "h t-i 

If the right-hand side of (71) is evaluated using the scaled values of the 
forward and backward probabilities, then each term in the inner 
summation will be scaled by C?D?+i, which will then be cancelled by 
the same factor which multiplies P k . Thus, using the scaled values in 
computing (69) results in an unsealed 5y. The procedure is easily 
extended to computation of the symbol probabilities. Also note that 
for the purposes of classification only one subsequence is to be consid- 
ered so that either (55) or (56) may be used unaltered to compute P. 

To apply Lagrangian techniques to left-to-right models we note that 
upon taking logarithms of (68) we have 

logP= E log***- < 72 ) 

A-l 

The derivatives needed to maximize log P in (72) can be obtained by 
evaluating expressions for the derivatives of each individual subse- 
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quence and summing. For example, for a,> we have [cf. (57) and (58)] 

d K d r * _1 

— (log P) m £ _ (log P) = C k T I a}(i)bAO&\H# +l (j). (73) 



dou k-\ da 



v 



As in all previous cases an analogous formula may be derived for the 
other parameters. 

In practice, A and B for left-to-right models are especially sparse. 
Some of the zero values are so by design but others are dependent on 
O. Parameters of this type will be found one at a time by standard line 
search strategies. We have found that the convergence of the Lagran- 
gian techniques can be substantially accelerated by taking large enough 
steps so that several positivity constraints become binding. The cor- 
responding variables are then clamped and (65) is applied before 
beginning the next iteration. 

V. NUMERICAL EXAMPLES 

In this section, we give some instructive examples of the behavior of 
several of the algorithms discussed above. The algorithms were all 
coded in FORTRAN 77 on a Data General MV-8000, which uses a 32- 
bit floating point word. The data used in the tests came from either a 
Monte Carlo simulation of a hidden Markov chain or from a portion 
of a newspaper text that was edited to include only the 26 characters 
of the English alphabet and a special character denoting an interword 
space. The simulations have the valuable property that the model is 
known a priori, so that simple models may be used for checking 
program correctness while the more complicated ones can elicit some 
subtle and important numerical and methodological characteristics of 
the algorithms. 

In our experiments we used the following procedure to generate 
observation sequences by means of a random number generator whose 
output is uniform on [0, 1] and specified values of ir, A, B, T, and V: 

(i) Partition the unit interval proportionally to the components of 
77. Generate a random number and select a start state, <#, according to 
the subinterval in which the number falls. Set t = 1. 

(ii) Partition the unit interval proportionally to the components of 
the ith row of B. Generate a random number and select a symbol, Vk, 
according to the subinterval in which the number falls. Set O t = Vk. 

(Hi) Partition the unit interval proportionally to the components of 
the ith row of A. Generate a random number and select the next state, 
qj, according to the subinterval in which the number falls. 

(iv) Increment t. If t < T set q, = qj and repeat (ii) through (iv); 
otherwise stop. 

Using this observation generator, several two- and three-state Mar- 
kov models were simulated. These simulations were used to verify that 
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the parameter estimation algorithms were working correctly and to 
study the effects of the scaling interval on the accuracy of the algo- 
rithms. In this study we found that all scaling intervals that were 
sufficiently short to prevent underflow yielded numerically identical 
results. Thus one can, at one extreme, scale the forward and backward 
probabilities after each observation or, at the other, wait until a 
threshold signaling that underflow is imminent is exceeded and only 
then perform the scaling operation. 

We next proceeded to study a pair of four-state (N = 4), four-symbol 
(JVf = 4) models shown below and referred to as SRC44 and SRC45. 

SRC44: 
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tt = [0.25 0.25 0.25 0.25] 
and 
SRC45: 
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tt = [0.25 0.25 0.25 0.25]. 

The state transition diagrams for these models are shown in Fig. 3. 

Model SRC44 is a balanced model in the sense that all permissible 
transitions and symbols are equally likely, whereas model SRC45 is 
skewed in that it distinctly favors some transitions and symbols over 
others. 




Fig. 3— The four-state model used for testing. 
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For each of these sources we processed observation sequences rang- 
ing in length from 100 to 4000 with the Baum-Welch algorithm. Initial 
values for A, B, and it were chosen at random and the algorithm was 
terminated in one of two ways; either when the change in log P from 
one iteration to the next fell below an arbitrary threshold, or when the 
number of iterations exceeded a specified maximum value. The maxi- 
mum number of iterations was varied from 100 to 1000. For each 
estimate, B, of the source matrix B, a measure of estimation error 



113-2*11- 






bjk - bp(j )k 



T2-. 1/2 



(74) 



was computed, where p(j) is the state permutation that minimizes the 
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-49.9 
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m -46.6 
O 

•a -io.5 



<s 



-48.4 
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NUMBER OF OBSERVATIONS 




1000 
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Fig. 4 — Estimation error as a function of number of observations for source SRC44 
for: (a) 100 iterations maximum, (b) 200 iterations, (c) 400 iterations, and (d) 10 random 
initial starts with 200 iterations maximum. 
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estimation error. The technique of minimum bipartite matching (see 
Appendix B) was used todetermine the optimum state permutation. 

Plots of the quantity \\B - B|| (on a log scale) versus T, the number 
of observations, are given in Fig. 4 for source SRC44. Separate results 
are shown for 100 iterations maximum (part a) of the B Wreestimation 
procedure, 200 iterations (part b), and 400 iterations (part c). Also 
shown in Fig. 4d are the results of using 10 random initial starts with 
200 iterations maximum. Shown in Fig. 4d are the maximum and 
minimum estimation errors for each T and the estimation error for the 
average of all 10 B matrices. (The reader should note that T goes to 
4000 in parts a through c, but only to 1000 in part d of Fig. 4.) 

The curves given in Fig. 4 show several very interesting properties 
of the reestimation procedure. First we see that as the number of 
observations increases, a slow decrease in the average estimation error 
is obtained. However, we can see that statistical fluctuations (owing to 
different initial guesses for the model parameters) are often of larger 
magnitude than the slowly decreasing components of the curve. As the 
maximum number of iterations increases, the magnitude of the statis- 
tical fluctuations decreases, especially for larger values of T. 

The curves of Fig. 4d show that although there is a wide range in 
the value of the estimation error for multiple starting choices, aver- 
aging the B matrices (after appropriate state alignment) leads to 
estimation errors comparable with the best single estimates. 

Figure 5 shows a similar set of results for the Markov source SRC45. 
Figure 5a shows a curve of estimation error versus number of obser- 
vations for a maximum of 400 iterations, Fig. 5b shows the same curve 
with a maximum of 1000 iterations, Fig. 5c shows the curve when the 
initial estimates of both A and B are set to the source values exactly, 
and Fig. 5d shows maximum and minimum estimation errors for 10 
random starting points. 

Although the general trends of the data in Fig. 5 are similar to those 
of Fig. 4, there are several key differences. From Fig. 5b it can be seen 
that even for 1000 iterations, the variation in model estimates is 
enormous (36-dB variations). This result suggests that it is significantly 
more difficult to estimate parameters of a skewed Markov model than 
those of a fairly uniform model. The curves of Fig. 5c, in which the 
initial conditions were set to the source generator values, show that 
extremely good solutions could be obtained if the reestimation proce- 
dure could start in the neighborhood of the "exact" solution. Obviously 
this situation (i.e., starting near the correct parameter values) is not 
enforceable for real data. 

The curves of Fig. 5d, in which multiple estimates of the Markov 
model are averaged, show that averaging the individual parameter 
estimates does not lead to a low error estimate for SRC45. This is 
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Fig. 5— Estimation error as a function of number of observations for source SRC45 
for: (a) 400 iterations maximum, (b) 1000 iterations maximum, (c) initial estimates of A 
and B set to source values, and (d) maximum and minimum estimation errors for 10 
random starting points. 



undoubtedly because of the parameter estimates with high errors that 
occur and which have an undue influence on the average. 

5. 1 Left-to-right Markov source estimation 

The second series of experiments dealt with the left-to-right Markov 
models, as would be appropriate for our intended application to iso- 
lated word recognition. Figure 6 shows four such models. For each of 
these models (denoted as SRC 195, SRC295, SRC395, and SRC495 in 
Fig. 6), the specifications were: 



N=5,M=9,tt= {1,0,0,0,0} 
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Fig. 6— Left-to-right models used for testing for: (a) SRC195, (b) SRC295, (c) SRC395, 
and (d) SRC495. 



and 



0.7 


0.3 





























0.8 


0.2 





























1 





























0.2 


0.8 





























0.3 


0.7. 



B = 



The state transition probabilities were those shown in Fig. 6. The 
SRC 195 model is a left- to-right model. The SRC295 model allows a 
transition between states 1 and 3 and states 3 and 5, as well as 
transitions between sequentially numbered states. Both the SRC395 
and SRC495 models include states whose self-transition probabilities 
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(an) are very small. We will see below that the average occupancy of 
such states is only 1 to 2 observations. For these non-ergodic models 
the concept of occupancy of the transient (i.e., non-absorbing) states 
is important. 

If we denote the probability of a transition from a state to itself as 
p, then the probability of a transition out of that state at time T + 1 
(assuming the state was entered at time t = 0) is 

Prob(?« at t - 0, 1, 2, .... T and g> * q t at T + 1) = p T (l - p) 

for models of the form of SRC 195. Hence the average occupancy of a 
state is given by 

d= l (t+l)p l (l-p) 
t-i 



1-p (75) 

For p = 0.9 we get d = 10, for p = 0.8 we get d = 5, for p = 0.5 we get 
d = 2, and for p = 0.1 we get d = 1.1. Standard formulas are available 
for computing average state occupancies for arbitrary transition ma- 
trices. We will not consider them here; however, it is clear that states 
2 and 4 in models SRC395 and SRC495 are of low occupancy. 

To test the reestimation procedure on the Markov sources of Fig. 6, 
a set of K sequences were generated for each model, where K was the 
set (10, 25, 50, 100). Each sequence was generated using the Markov 
sequence generator described earlier, modified slightly to ensure that 
each sequence terminated in state 5 (the final state) and stayed there 
for 5 observations. The sequences were, however, of variable duration, 
depending on the exact sequence of state transitions that occurred. 

The results showed that for sequences generated from model 
SRC 195, the correct model parameters (to within small estimation 
errors) were obtained for all values of K from 10 to 100. For sequences 
generated from model SRC295 only the 10 observation training se- 
quence yielded grossly incorrect model parameter estimates. All other 
sequences (K = 25, 50, 100) yielded the correct parameter values. 

For both source models SRC395 and SRC495, however, no com- 
pletely correct parameter estimates were obtained. In particular, the 
states whose expected occupancy was small (i.e., states 2 and 4 in both 
models) were merged with either the preceding or the following state 
(or both), while other states whose expected occupancy was larger 
were often split into 2 states, each with the same set of output symbols. 
These experiments indicate that it is difficult to reliably estimate 
parameters of a state, in a left-to-right model, whose average occu- 
pancy is very much smaller than that of the states to which it is 
connected. 
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5.2 Tests on a non-Markov source 

In the experiments described above, the observations were in fact 
generated by a known probabilistic function of a Markov chain. A 
more difficult test of these techniques was performed in which the 
observations were 5000 characters from a newspaper article edited to 
contain only the letters of the alphabet and spaces. We used the 
observation sequence to train a four-state, 27-symbol model. Of course, 
the "true" underlying model is not known, as it was in the tests 
previously described. Nor is it likely that the four-state model is 
complex enough to model the richness of structure of written English. 
Even if it were, it is unlikely that 5000 characters is sufficient to 
capture the structure. Unfortunately, these are exactly the limitations 
with which the experimenter will be faced in trying to model "real" 
processes. Our hope was that the text analysis problem would reveal 
some of the ambiguities that will be encountered in making hidden 
Markov models of natural phenomena. 

The text was first analyzed using the Baum- Welch reestimation 
formulas with a randomly chosen starting point. The algorithm con- 
verged in 310 iterations with log P = -1.317 X 10 4 . For purposes of 
comparison, we analyzed the same data with a quasi-Newton optimi- 
zation routine, VE01A, from the Harwell Library. 26 It required 125 
iterations to obtain a maximum value of log P of -1.356 x 10\ In this 
case some care was required with the parameter values. The finite 
precision arithmetic occasionally results in a parameter value of —10" , 
which appears to satisfy the positivity constraints. Such a value is fatal 
to the computation of log P since it will result in an attempt to take a 
logarithm of a negative quantity. Fortunately, this condition is readily 
detected in the scaling routine and corrected by setting the offending 
parameter to zero. 

Finally, we applied the Lagrangian technique described earlier to 
the same observation sequence but with a different set of initial 
parameter values. After 136 iterations, a still different model with 
log P = -1.327 X 10 4 is obtained. These results illustrate some 
important features of hidden Markov modeling. The computational 
methods used to obtain the models are roughly equivalent. All of the 
resulting models capture some of the structure of the data being 
analyzed. There are many different possible models with very little 
evidence for selecting the "best" one. For even very simple models, 
the likelihood function is too complicated to attribute the selection of 
one model or another by one algorithm or another to its properties. 
Finally, we note that all of the algorithms tested make large improve- 
ments to P during the early iterations and only slight incremental 
improvements later. In fact, the last half of the iterations provides no 
significant change to the model. We have used a convergence criterion 
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of 10" 7 on the relative increment between iterations. This may be 
relaxed substantially, resulting in many fewer iterations with no at- 
tendant degradation in the model. 

VI. CONCLUSION 

We have presented some of the salient theoretical and practical 
issues associated with modeling data by probabilistic functions of a 
Markov chain. In our presentation we have concentrated on three 
issues: alternatives to the Baum- Welch reestimation algorithm, critical 
facets of implementation, and behavior of Markov models on certain 
artificial but realistic signals. 

We have observed that, while most of the discussion of parameter 
estimation for Markov models in the open literature is devoted to the 
Baum- Welch algorithm, classical optimization techniques are not only 
a viable alternative but may even be preferable in some cases. In 
particular, classical techniques are virtually unrestricted by the forms 
of either the likelihood function or the constraints. The reestimation 
formulas may be growth transformations for a wider class of functions 
and constraints than has heretofore been proven; however, it is not 
likely that a universal reestimation formula exists. For applications to 
continuous density functions (c.f. Liporace 29 ), the classical techniques 
may have still other advantages. 

The open literature has provided only a perfunctory, if any, discus- 
sion of some crucial numerical and implementational problems asso- 
ciated with Markov modeling. We have given details of methods of 
dealing with floating point loss of significance, finite-training-set size, 
and model stability. Wherever possible we have made our techniques 
formal and algorithmic. 

Finally, we have given several examples of the behavior of Markov 
modeling techniques on some reasonably realistic data. The most 
important lesson that can be drawn from these experiments is that 
even under ideal conditions (i.e., when the data are associated with a 
known hidden Markov process) and all the more so under realistic 
conditions, the computed models may contain artifacts and may not 
faithfully represent the inherent structure of the data. Thus, great 
caution and empirical validation is required in using these techniques. 

Despite this caveat, hidden Markov models may be beneficial in 
studying many diverse problems. In our companion paper 21 we recount 
a successful application of this body of theory to a problem in auto- 
matic speech recognition. 
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APPENDIX A 

Several of the formulas derived in the text are much more compact 
in matrix notation. Let ' denote matrix transposition, as usual, and let 
the column vectors m and 1, and the matrices A, B t , t = 1, • • • , The 
defined as in the text. Also let a t and /?, be column vectors with 
components a t (i), i = 1, • • • , N and p t (i), i = 1, • • • , N, respectively. 
Then the recursion for a t is 

ot+i = B t+ iA'a t , t = 1, . • • , T- 1. (76) 

The recursion for fi t is 

p t = AB t+l t+1 t=T-l, ...,1. (77) 

The starting values are 

Oti = BiTT 

Pt = 1. (78) 

The probability P is given by 

P = P'toct for any t in (1, T). (79) 

The special cases t = 1 and t = T give 

P = ir'B^ (80) 

and 

P = VctT 

= VBtA'Bt-i • • • A'Biw. (81) 

In each of these formulas P can be regarded as the trace of a 1 X 1 
matrix, which [as expanded in (81)], is a product of several matrices. 
The fact that the trace of a product of matrices is invariant to a cyclic 
permutation of the matrices can be used to advantage in finding the 
gradient of P. Define VaP as the matrix whose if th component is 
dP/ddij. Similarly, define V B P and V„P. Then it is straightforward to 
show that 

VJ> = B& 

T-l 

VaP= 2 a,# +1 £ m 
t-i 

(V B P)j k = I (A'a t -MPt)j. (82) 

oo,-k 
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In the last equation, if Oi = v k then the corresponding term in the 
sum is just ir' ft i. 

APPENDIX B 

As we mentioned in Section 3.3, it is frequently necessary to compare 
(or average) two different estimates B and B of the symbol matrix. 
The optimization procedures, in general, relabel the states; therefore 
the rows of B may, in general, be permuted relative to those of B. 
Before comparison, therefore, the optimum permutation must be 
found. This is defined as one that minimizes the distance D defined in 
eq. (66). The problem of finding this permutation can be converted to 
a network optimization problem called "bipartite weighted matching." 
To this end define Uty as the distance of the ith row of B from they'th 
row of B. As we have done in Fig. 7, draw two sets P and Q of N 
vertices each. For 1 < i,j < N, draw an edge from the ith vertex in P 
to the j th vertex in Q, and label this edge with the weight wy. The 
resulting graph is a complete weighted bipartite graph, and the prob- 
lem is to find an N-match (i.e., one to one matching with N edges) 
such that it has minimum weight. 

Suppose that Z k is a A-match (k < N). With respect to this match 
make the following definitions: 




---FREE VERTEX 



Fig. 7— Complete bipartite weighted graph (N = 4) and a 2-match (matched edge 
— »). Path shown in heavy lines is an alternating path for the 2-match. It also happens to 
be an augmenting path. 
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(i) A matched edge is an edge in Z*. 

(ii) A free vertex is one that is not on a matched edge. 

(Hi) An alternating path is a path along edges that alternately 
belong to Z* and do not belong to Z*. [N.B. the number of matched 
edges, m, on an alternating path may have any value m < k. The 
degenerate case m = is also valid.] 

(iv) An augmenting path is an alternating path between two free 
vertices. Again note that a single edge connecting two free vertices is 
a valid augmenting path. 

An augmenting path has the structure 

PiUqMpzUqi ••• Uqj. (83) 

Here U represents an unmatched edge, M represents a matched edge, 
andpi and qj are the only free vertices on the path. (Here/?, is the ith 
P- vertex along the path. The qi are similarly defined.) Note that the 
number of U's on an augmenting path is exactly one more than the 
number of M's. Hence the total number of edges in an augmenting 
path is always odd. 

Suppose we are given a £-match Z* and an augmenting path ap of 
length 2/ + 1. Then we can obtain a (k + l)-match Z*+i by a 
complementary labeling of the edges of ap (i.e., by changing every U 
to M and vice versa). If W\ t Ufe, • • • , Wy+i are the weights along ap, 
then the weight Z*+i exceeds that of Z* by the amount W\ — W2 + 
w 3 — • • • + W2J+1. 

This method of obtaining a (k + l)-match from a A-match has the 
following key property (proof given below): Suppose M* is a minimum- 
weight A-match. Let apm be an augmenting path for M* with minimum 
incremental weight. Then the match M*+i obtained from M* and apm 
is a minimum weight (k + 1) -match. 

Assuming this property for the moment, a minimum-weight N- 
match can be determined by the following iV-step algorithm: 

For k = 1, 2, • • • , N generate M* by finding an optimum augmenting 
path for M*_i. (Note that Mo is an empty set.) 

We now show that finding an optimum augmenting path is equiva- 
lent to a shortest path problem. With reference to Fig. 8 let us generate 
a directed graph by directing each edge in M* to the left and each 
unmatched edge to the right. Also let us multiply the weights of all 
matched edges by — 1. Finally, add two vertices labeled s and t. 
Connect 8 to all free vertices in P by right-going edges of zero weight. 
Similarly, connect all free vertices of Q to t by right-going edges of 
zero weight. In this directed graph any path from s to t is an aug- 
menting path for the matching M*-i. Hence, if we interpret the weights 
as lengths, it is clear than an optimum augmenting path is a shortest 
path from 8 to t. 
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Fig. 8— Directed graph obtained from Fig. 7. Every path from s to t is an augmenting 
path of the 2-match (except for the dummy edges of weight from s and to t). 
Interpreting weights as lengths, the length of a path from s to t is the incremental weight 
of the corresponding augmenting path. 

A shortest path problem can be solved in polynomial time. However, 
the problem can be solved much more easily and efficiently if the path 
lengths are all nonnegative. In that case the problem can be solved in 
N 2 time by Dijkstra's method. 28 

It is possible to avoid negative distances in our problem by the 
method of assigning a "potential" f(v) to each vertex. Suppose that at 
step k the graph has nonnegative weights, and the edges corresponding 
to M*_i have zero weight. Then use Dijkstra's method to find shortest 
paths to all vertices (including t) from s. Define f(v) for the vertex v 
as its shortest distance from s. Next, modify the weight Uty of the edge 
from i to j to 

w'u^Wij + fiD-fU)- (84) 

It is easily seen that this procedure leaves all weights nonnegative, 
does not alter shortest paths, and all shortest paths have weight 0. 
Reversing a shortest path from s to t gives us a matching M*, and the 
new graph has nonnegative weights and zero weights for the matched 
edges. Now Mo trivially has the postulated properties. Therefore, at 
every step the graph will have these properties. 

An important property of the above procedure is that if a vertex p 
(or q) is on some matched edge in M k , then it will be on some matched 
edge in M*+i also. In writing the actual computer program, this 
property simplifies the bookkeeping considerably. 

Another important property concerns the vertex potentials. Suppose 
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Fig. 9 — Types of paths in the symmetric difference between M* and N* +] . 
There must be exactly one more path of type (iv) than of type (Hi). 



M* 

N* + r 



F(v) is the sum of the potentials assigned to vertex v at each of the N 
steps. Suppose Wy and dy are the original and final weights, respec- 
tively, of the edge connecting vertex i to vertex 7. Then 



dij = w u + F(i) - F(j) 2= 0, 



(85) 



and dij = for every edge in the final iV-match. The numbers F(u) 
thus provide a simple proof that the final match is indeed an optimum 
match. 

We turn now to a proof of the key property mentioned above. Let 
M* be an optimum £-match and let Nk+i be any optimum k + 1-match. 
Then we will show that there is a k + 1-match obtained from an 
augmenting path of M* such that its weight is equal to that of Nm+i. 
For this purpose define S as the set of edges in the symmetric difference 
of M* and N*+i. (Recall that the symmetric difference of two sets, A 
and B, is the set of elements that belong either to A or to B but not to 
both.) 

From the geometry of a bipartite graph three properties of the set 
S are obvious: 

(i) The number of edges in S which belong to N/t +] must exceed 
the number that belongs to M* by exactly 1. 

(ii) The edges on any path in S must alternate between M* and 
N* + i. 
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(Hi) A vertex on an edge in S cannot be shared with any edge in M* 
or N*+i that does not belong to S. 

From the first two of these properties it follows that there can be four 
types of paths in S (see Fig. 9). 

(i) Circuits, each consisting of an even number of edges; 
(ii) Open paths, each with an even number of edges; 

(Hi) Open paths, each with an odd number of edges beginning with 
an edge in M*; 

(iv) Open paths, each with an odd number of edges beginning with 
an edge in N*+i. The number of such paths must be exactly one more 
than the number of paths of type (Hi). 

It is easily seen that the incremental weight of every path of type (i) 
or (ii) must be exactly zero. (If it is negative then the weight of M* can 
be decreased by a complementary labeling of the path; if positive then 
the weight of N A +i can be decreased in the same way. But this 
contradicts the hypothesis that M* and N*+i are optimum matches.) 

In view of this and property (3), N*+i can be modified by replacing 
its edges on all paths of types (i) and (ii) by the corresponding edges 
of M*. The modified k + 1-match has exactly the same weight as that 
of Nk+i, however, the symmetric difference between M* and the mod- 
ified k + 1-match has no paths of type (i) or (ii). 

The same argument applies to pairs of paths, one path each of types 
(Hi) and (iv). Thus a final modified k + 1 match is obtained whose 
symmetric difference with M* is exactly one path of type (iv). The 
weight of this final modified k + 1-match is exactly the same as that 
of the original N*+i, and is obtained from an augmenting path of M*. 

We have written a subroutine that implements the above procedure. 
The timing, from a number of test runs, is approximately 0.063iV 3 ms 
central processing unit (CPU) time on the MV-8000. Thus for N = 40 
about 4 seconds of CPU time is needed. 
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